Capstone Project - Predicting House Prices in KING COUNTY¶
The purpose of this project is to build regression model and apply feature engineering methods to good accuracy ensuring less prediction errors for estimating house prices in KING COUNTY.The following are the attributes of the dataset
cida : notation for a house
dayhours : Date house was sold
price : Price is prediction target (Target Variable)
room_bed : Number of Bedrooms/House
room_bath : Number of bathrooms/bedrooms
living_measure : Square footage of the home
lot_measure : Square footage of the lot
ceil : Total floors (levels) in house
coast : House which has a view to a waterfront
sight : Has been viewed
condition : How good the condition is (Overall)
quality : grade given to the housing unit, based on grading system
ceil_measure : square footage of house apart from basement
basement_measure : square footage of the basement
yr_built : Built Year
yr_renovated : Year when house was renovated
zipcode : zip
lat : Latitude coordinate
long : Longitude coordinate
living_measure15 : room area in 2015(implies-- some renovations)
lot_measure15 : lotSize area in 2015(implies-- some renovations)
furnished : Based on the quality of room
total_area : Measure of both living and lot
1. Import Libraries and load dataset¶
It is always a good practice to eye-ball raw data to get a feel of the data in terms of number of structure of the file, number of attributes, types of attributes.
Contains 21613 rows with 23 columns. Data is as expected.
2. Data Cleaning and Exploration¶
Extract Year and Month from the column dayhours, add to the database and then drop the attribute dayhours¶
Calculate the age of the property and age after renovation before sold year¶
Contains 21613 rows with 26 columns
Few comments
------------
The attribute "room_bath" which represents " Number of Bath room" contains values in decimals which represents number of bathroom per bedroom. Example if we have 2-bed room house and there is 1 bathroom then effectively we could have 0.5 bathroom/bedroom.
The attribute "basement" is having 0 indicating not every house will have the basement
The attribute "price" is the Target variable which also known as dependent variable
Plot missing values (if any)¶
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0xa739198>
There are no missing values. Hence any sort of imputation is not required
Check for duplicates based on the feature "cid"¶
There are duplicate rows based on cid's. List the duplicate rows.
There are 353 duplicate rows. All these duplicate rows have same values except price.
Droping the duplicate rows
Check for zero values (if any)¶
Feature basement has the high(60%) percentage of zero values. However when clarified the attribute "basement" is having 0 indicates not every house will have the basement. Hence it is a valid value
The other feaatures such as furnished, yr-renovated, coast, sight will have no impact because for the feature
1. furnished - Valid option to represent status
2. coast - Valid option to represent status
3. sight - Valid option to represent status
4. yr-renovated - Valid option to represent status
Features such as room_bed and room_bath has very less % of zero values. WIll impute with the median value.Not considering the features cid and dayhours for Descriptive Stats as will not be of any help in giving value add information.
Now it's time to look at the data and see what, if anything, looks weird or notable.
Sale prices look more normally distributed, so it might behave well to the models
Exploring categorical features representing in numeric (Label encoders)¶
Majority of houses are not facing the waterfront
Majority of the houses are sold with no customer visits
Majority of the houses belong to condition 3 followed 4 & 5
Most of the houses belong to quality 7 followed by 8
3. Descriptive statistics of numeric columns¶
Out[31]:
|
count |
mean |
std |
min |
25% |
50% |
75% |
max |
| yr_built |
21436.0 |
1971.098433 |
29.385277 |
1900.0000 |
1952.0000 |
1975.000 |
1997.000 |
2.015000e+03 |
| yr_renovated |
21436.0 |
84.729800 |
402.431012 |
0.0000 |
0.0000 |
0.000 |
0.000 |
2.015000e+03 |
| house_age |
21436.0 |
43.223736 |
29.387271 |
-1.0000 |
17.0000 |
39.000 |
63.000 |
1.150000e+02 |
| age_after_renovtion |
21436.0 |
0.779903 |
4.891372 |
-1.0000 |
0.0000 |
0.000 |
0.000 |
8.000000e+01 |
| room_bed |
21436.0 |
3.371571 |
0.929205 |
0.0000 |
3.0000 |
3.000 |
4.000 |
3.300000e+01 |
| room_bath |
21436.0 |
2.117349 |
0.769913 |
0.0000 |
1.7500 |
2.250 |
2.500 |
8.000000e+00 |
| living_measure |
21436.0 |
2082.704936 |
919.146469 |
290.0000 |
1430.0000 |
1920.000 |
2550.000 |
1.354000e+04 |
| lot_measure |
21436.0 |
15135.637852 |
41538.620606 |
520.0000 |
5040.0000 |
7614.000 |
10696.250 |
1.651359e+06 |
| ceil |
21436.0 |
1.496198 |
0.540388 |
1.0000 |
1.0000 |
1.500 |
2.000 |
3.500000e+00 |
| coast |
21436.0 |
0.007604 |
0.086871 |
0.0000 |
0.0000 |
0.000 |
0.000 |
1.000000e+00 |
| sight |
21436.0 |
0.235118 |
0.767092 |
0.0000 |
0.0000 |
0.000 |
0.000 |
4.000000e+00 |
| condition |
21436.0 |
3.410384 |
0.650235 |
1.0000 |
3.0000 |
3.000 |
4.000 |
5.000000e+00 |
| quality |
21436.0 |
7.661737 |
1.174256 |
1.0000 |
7.0000 |
7.000 |
8.000 |
1.300000e+01 |
| ceil_measure |
21436.0 |
1790.960440 |
829.026491 |
290.0000 |
1200.0000 |
1560.000 |
2220.000 |
9.410000e+03 |
| basement |
21436.0 |
291.744495 |
442.781983 |
0.0000 |
0.0000 |
0.000 |
560.000 |
4.820000e+03 |
| zipcode |
21436.0 |
98077.862288 |
53.469371 |
98001.0000 |
98033.0000 |
98065.000 |
98117.000 |
9.819900e+04 |
| lat |
21436.0 |
47.560156 |
0.138601 |
47.1559 |
47.4711 |
47.572 |
47.678 |
4.777760e+01 |
| long |
21436.0 |
-122.213697 |
0.140896 |
-122.5190 |
-122.3280 |
-122.230 |
-122.124 |
-1.213150e+02 |
| living_measure15 |
21436.0 |
1988.314378 |
685.699093 |
399.0000 |
1490.0000 |
1840.000 |
2370.000 |
6.210000e+03 |
| lot_measure15 |
21436.0 |
12785.961280 |
27375.467469 |
651.0000 |
5100.0000 |
7620.000 |
10087.250 |
8.712000e+05 |
| furnished |
21436.0 |
0.197518 |
0.398136 |
0.0000 |
0.0000 |
0.000 |
0.000 |
1.000000e+00 |
| total_area |
21436.0 |
17218.342788 |
41706.504133 |
1423.0000 |
7031.5000 |
9575.000 |
13006.250 |
1.652659e+06 |
| price |
21436.0 |
541195.802435 |
367730.178278 |
75000.0000 |
323000.0000 |
450000.000 |
645000.000 |
7.700000e+06 |
Descriptive statistics is meant to know how data is distributed.What does descriptive statistics indicates
1. If mean not equal to Median(Q2) indicates the existance of tail.
If mean > median, long tail on right side. Skew on right side.
If mean < median, long tail on left side. Skew on left side
2. When we observe difference between (Q1,min) and (Q3,max) is not same data is not symmetric meaning
If difference between (Q1,min) > (Q3,max) indicates existance of long skewed tail on left side
If difference between (Q1,min) < (Q3,max) indicates existance of long skewed tail on right side
3. When we have longs skews, indicates existance of ourliers.
Based on the understanding what we can infer from the above is
Section A
---------
1. The first column represents count, which is number of records. In this case all features have no missing values
2. The second column mean value (average) of each column
3. The third column std value (standard deviation) of each column
4. The fourth column mean value (average) of each column
5. The fifth column min value (minimum) of each column
6. The sixth column 25% value (Q1-quartile) of each column
7. The sixth column 50% value (Q2-quartile) of each column. Also known as Median
8. The sixth column 75% value (Q3-quartile) of each column
9. The sixth column max value (maximum) of each column
Section B
---------
1. sold_year
Data normally distributed
2. room_bed
mean > median. Long tail on right side. Skew on right side.
3. room_bath
mean < median. Long tail on left side. Skew on left side.
4. living_measure
mean > median. Long tail on right side. Skew on right side.
5. lot_measure
mean > median. Long tail on right side. Skew on right side.
6. ceil
mean ~ median. Normally distributed
7. coast
mean = median. Normally distributed
8. sight
mean > median. Long tail on right side. Skew on right side
9. condition
mean > median. Long tail on right side. Skew on right side
10.quality
mean > median. Long tail on right side. Skew on right side
11.ceil_measure
mean > median. Long tail on right side. Skew on right side
12.basement
mean > median. Long tail on right side. Skew on right side
13.yr_built
mean < median. Long tail on left side. Skew on left side
14.yr_renovated
mean > median. Long tail on right side. Skew on right side
15.living_measure15
mean > median. Long tail on right side. Skew on right side
16.lot_measure15
mean > median. Long tail on right side. Skew on right side
17.furnished
mean ~ median. Normally distributed
18.total_area
mean > median. Long tail on right side. Skew on right side
Summary
-------
No Missing values
The data is not normally distributed. Deviating from Normal Distribution.
Outliers for almost all the features except for the features ceil, coast and furnished
Positive and Negative skewness has been observed. Futher analyze by visualizing the distribution
Will plot and check for outliers as well.
Drop the target variable¶
Visualizing the distribution of a dataset. Plot the variables and check for the distributions.¶
The target attribute price distribution¶
Price skewed on right side
Plotting univariate distributions¶
Negative Left skewed distribution. Median Closure to Q1. Mean > Median
('Mean :', 1971.0)
('Median :', 1975.0)
('25% :', 1952.0)
('75% :', 1997.0)
Positive Right skewed distribution. Median Closure to Q3. Mean > Median
('Mean :', 85.0)
('Median :', 0.0)
('25% :', 0.0)
('75% :', 0.0)
Positive Right skewed distribution. Median Closure to Q3. Mean > Median
('Mean :', 43.0)
('Median :', 39.0)
('25% :', 17.0)
('75% :', 63.0)
Positive Right skewed distribution. Median Closure to Q3. Mean > Median
('Mean :', 1.0)
('Median :', 0.0)
('25% :', 0.0)
('75% :', 0.0)
Unskewed distribution. Mean = Median
('Mean :', 3.0)
('Median :', 3.0)
('25% :', 3.0)
('75% :', 4.0)
Unskewed distribution. Mean = Median
('Mean :', 2.0)
('Median :', 2.0)
('25% :', 1.75)
('75% :', 2.5)
Positive Right skewed distribution. Median Closure to Q3. Mean > Median
('Mean :', 2083.0)
('Median :', 1920.0)
('25% :', 1430.0)
('75% :', 2550.0)
Positive Right skewed distribution. Median Closure to Q3. Mean > Median
('Mean :', 15136.0)
('Median :', 7614.0)
('25% :', 5040.0)
('75% :', 10696.25)
Negative Left skewed distribution. Median Closure to Q1. Mean > Median
('Mean :', 1.0)
('Median :', 2.0)
('25% :', 1.0)
('75% :', 2.0)
Unskewed distribution. Mean = Median
('Mean :', 0.0)
('Median :', 0.0)
('25% :', 0.0)
('75% :', 0.0)
Unskewed distribution. Mean = Median
('Mean :', 0.0)
('Median :', 0.0)
('25% :', 0.0)
('75% :', 0.0)
Unskewed distribution. Mean = Median
('Mean :', 3.0)
('Median :', 3.0)
('25% :', 3.0)
('75% :', 4.0)
Positive Right skewed distribution. Median Closure to Q3. Mean > Median
('Mean :', 8.0)
('Median :', 7.0)
('25% :', 7.0)
('75% :', 8.0)
Positive Right skewed distribution. Median Closure to Q3. Mean > Median
('Mean :', 1791.0)
('Median :', 1560.0)
('25% :', 1200.0)
('75% :', 2220.0)
Positive Right skewed distribution. Median Closure to Q3. Mean > Median
('Mean :', 292.0)
('Median :', 0.0)
('25% :', 0.0)
('75% :', 560.0)
Positive Right skewed distribution. Median Closure to Q3. Mean > Median
('Mean :', 98078.0)
('Median :', 98065.0)
('25% :', 98033.0)
('75% :', 98117.0)
Unskewed distribution. Mean = Median
('Mean :', 48.0)
('Median :', 48.0)
('25% :', 47.4711)
('75% :', 47.678000000000004)
Unskewed distribution. Mean = Median
('Mean :', -122.0)
('Median :', -122.0)
('25% :', -122.32799999999999)
('75% :', -122.124)
Positive Right skewed distribution. Median Closure to Q3. Mean > Median
('Mean :', 1988.0)
('Median :', 1840.0)
('25% :', 1490.0)
('75% :', 2370.0)
Positive Right skewed distribution. Median Closure to Q3. Mean > Median
('Mean :', 12786.0)
('Median :', 7620.0)
('25% :', 5100.0)
('75% :', 10087.25)
Unskewed distribution. Mean = Median
('Mean :', 0.0)
('Median :', 0.0)
('25% :', 0.0)
('75% :', 0.0)
Positive Right skewed distribution. Median Closure to Q3. Mean > Median
('Mean :', 17218.0)
('Median :', 9575.0)
('25% :', 7031.5)
('75% :', 13006.25)
Get the Outliers for each feature based on IQR Score
Out[60]:
|
yr_built |
yr_renovated |
house_age |
age_after_renovtion |
room_bed |
room_bath |
living_measure |
lot_measure |
ceil |
coast |
... |
ceil_measure |
basement |
zipcode |
lat |
long |
living_measure15 |
lot_measure15 |
furnished |
total_area |
price |
| False |
21436.0 |
21436.0 |
21436.0 |
21436.0 |
21436.0 |
21436.0 |
21335 |
6500 |
21436.0 |
21436.0 |
... |
21398 |
21436.0 |
0.0 |
21436.0 |
21436.0 |
21428 |
6496 |
21436.0 |
2957 |
0.0 |
| True |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
101 |
14936 |
0.0 |
0.0 |
... |
38 |
0.0 |
21436.0 |
0.0 |
0.0 |
8 |
14940 |
0.0 |
18479 |
21436.0 |
2 rows × 23 columns
False row indicates records within in IQR Range
True row indicates outliers which are outside IQR Range
('Median :', 1975.0)
('Q1 :', 1952.0)
('Q3 :', 1997.0)
('IQR :', 45.0)
('Max (U_whisker) :', 2064.5)
('Min (L_whisker) :', 1884.5)
----------------------------------
Outliers count based on IQR Score:
----------------------------------
False 21436
Name: yr_built, dtype: int64
('Median :', 0.0)
('Q1 :', 0.0)
('Q3 :', 0.0)
('IQR :', 0.0)
('Max (U_whisker) :', 0.0)
('Min (L_whisker) :', 0.0)
----------------------------------
Outliers count based on IQR Score:
----------------------------------
False 21436
Name: yr_renovated, dtype: int64
('Median :', 39.0)
('Q1 :', 17.0)
('Q3 :', 63.0)
('IQR :', 46.0)
('Max (U_whisker) :', 132.0)
('Min (L_whisker) :', -52.0)
----------------------------------
Outliers count based on IQR Score:
----------------------------------
False 21436
Name: house_age, dtype: int64
('Median :', 0.0)
('Q1 :', 0.0)
('Q3 :', 0.0)
('IQR :', 0.0)
('Max (U_whisker) :', 0.0)
('Min (L_whisker) :', 0.0)
----------------------------------
Outliers count based on IQR Score:
----------------------------------
False 21436
Name: age_after_renovtion, dtype: int64
('Median :', 3.0)
('Q1 :', 3.0)
('Q3 :', 4.0)
('IQR :', 1.0)
('Max (U_whisker) :', 5.5)
('Min (L_whisker) :', 1.5)
----------------------------------
Outliers count based on IQR Score:
----------------------------------
False 21436
Name: room_bed, dtype: int64
('Median :', 2.25)
('Q1 :', 1.75)
('Q3 :', 2.5)
('IQR :', 0.75)
('Max (U_whisker) :', 3.625)
('Min (L_whisker) :', 0.625)
----------------------------------
Outliers count based on IQR Score:
----------------------------------
False 21436
Name: room_bath, dtype: int64
('Median :', 1920.0)
('Q1 :', 1430.0)
('Q3 :', 2550.0)
('IQR :', 1120.0)
('Max (U_whisker) :', 4230.0)
('Min (L_whisker) :', -250.0)
----------------------------------
Outliers count based on IQR Score:
----------------------------------
False 21335
True 101
Name: living_measure, dtype: int64
('Median :', 7614.0)
('Q1 :', 5040.0)
('Q3 :', 10696.25)
('IQR :', 5656.25)
('Max (U_whisker) :', 19180.625)
('Min (L_whisker) :', -3444.375)
----------------------------------
Outliers count based on IQR Score:
----------------------------------
True 14936
False 6500
Name: lot_measure, dtype: int64
('Median :', 1.5)
('Q1 :', 1.0)
('Q3 :', 2.0)
('IQR :', 1.0)
('Max (U_whisker) :', 3.5)
('Min (L_whisker) :', -0.5)
----------------------------------
Outliers count based on IQR Score:
----------------------------------
False 21436
Name: ceil, dtype: int64
('Median :', 0.0)
('Q1 :', 0.0)
('Q3 :', 0.0)
('IQR :', 0.0)
('Max (U_whisker) :', 0.0)
('Min (L_whisker) :', 0.0)
----------------------------------
Outliers count based on IQR Score:
----------------------------------
False 21436
Name: coast, dtype: int64
('Median :', 0.0)
('Q1 :', 0.0)
('Q3 :', 0.0)
('IQR :', 0.0)
('Max (U_whisker) :', 0.0)
('Min (L_whisker) :', 0.0)
----------------------------------
Outliers count based on IQR Score:
----------------------------------
False 21436
Name: sight, dtype: int64
('Median :', 3.0)
('Q1 :', 3.0)
('Q3 :', 4.0)
('IQR :', 1.0)
('Max (U_whisker) :', 5.5)
('Min (L_whisker) :', 1.5)
----------------------------------
Outliers count based on IQR Score:
----------------------------------
False 21436
Name: condition, dtype: int64
('Median :', 7.0)
('Q1 :', 7.0)
('Q3 :', 8.0)
('IQR :', 1.0)
('Max (U_whisker) :', 9.5)
('Min (L_whisker) :', 5.5)
----------------------------------
Outliers count based on IQR Score:
----------------------------------
False 21436
Name: quality, dtype: int64
('Median :', 1560.0)
('Q1 :', 1200.0)
('Q3 :', 2220.0)
('IQR :', 1020.0)
('Max (U_whisker) :', 3750.0)
('Min (L_whisker) :', -330.0)
----------------------------------
Outliers count based on IQR Score:
----------------------------------
False 21398
True 38
Name: ceil_measure, dtype: int64
('Median :', 0.0)
('Q1 :', 0.0)
('Q3 :', 560.0)
('IQR :', 560.0)
('Max (U_whisker) :', 1400.0)
('Min (L_whisker) :', -840.0)
----------------------------------
Outliers count based on IQR Score:
----------------------------------
False 21436
Name: basement, dtype: int64
('Median :', 98065.0)
('Q1 :', 98033.0)
('Q3 :', 98117.0)
('IQR :', 84.0)
('Max (U_whisker) :', 98243.0)
('Min (L_whisker) :', 97907.0)
----------------------------------
Outliers count based on IQR Score:
----------------------------------
True 21436
Name: zipcode, dtype: int64
('Median :', 47.571999999999996)
('Q1 :', 47.4711)
('Q3 :', 47.678000000000004)
('IQR :', 0.20690000000000452)
('Max (U_whisker) :', 47.98835000000001)
('Min (L_whisker) :', 47.16074999999999)
----------------------------------
Outliers count based on IQR Score:
----------------------------------
False 21436
Name: lat, dtype: int64
('Median :', -122.23)
('Q1 :', -122.32799999999999)
('Q3 :', -122.124)
('IQR :', 0.20399999999999352)
('Max (U_whisker) :', -121.81800000000001)
('Min (L_whisker) :', -122.63399999999999)
----------------------------------
Outliers count based on IQR Score:
----------------------------------
False 21436
Name: long, dtype: int64
outliers on longitude should not be a concern
('Median :', 1840.0)
('Q1 :', 1490.0)
('Q3 :', 2370.0)
('IQR :', 880.0)
('Max (U_whisker) :', 3690.0)
('Min (L_whisker) :', 170.0)
----------------------------------
Outliers count based on IQR Score:
----------------------------------
False 21428
True 8
Name: living_measure15, dtype: int64
('Median :', 7620.0)
('Q1 :', 5100.0)
('Q3 :', 10087.25)
('IQR :', 4987.25)
('Max (U_whisker) :', 17568.125)
('Min (L_whisker) :', -2380.875)
----------------------------------
Outliers count based on IQR Score:
----------------------------------
True 14940
False 6496
Name: lot_measure15, dtype: int64
('Median :', 0.0)
('Q1 :', 0.0)
('Q3 :', 0.0)
('IQR :', 0.0)
('Max (U_whisker) :', 0.0)
('Min (L_whisker) :', 0.0)
----------------------------------
Outliers count based on IQR Score:
----------------------------------
False 21436
Name: furnished, dtype: int64
('Median :', 9575.0)
('Q1 :', 7031.5)
('Q3 :', 13006.25)
('IQR :', 5974.75)
('Max (U_whisker) :', 21968.375)
('Min (L_whisker) :', -1930.625)
----------------------------------
Outliers count based on IQR Score:
----------------------------------
True 18479
False 2957
Name: total_area, dtype: int64
('Median :', 450000.0)
('Q1 :', 323000.0)
('Q3 :', 645000.0)
('IQR :', 322000.0)
('Max (U_whisker) :', 1128000.0)
('Min (L_whisker) :', -160000.0)
----------------------------------
Outliers count based on IQR Score:
----------------------------------
True 21436
Name: price, dtype: int64
Conclusion:
Most of these outliers represents high cost big houses. Hence data is genuine. WIll use Geo location analysis to plot the latitudes and longitudines on a map at the end of EDA.
Plotting bivariate distributions¶
It can also be useful to visualize a bivariate distribution of two variables. The easiest way to do this in seaborn by using jointplot() function. Shows the relationship between two variables along with the univariate distribution of each on separate axes.
### Scatterplots
The most familiar way to visualize a bivariate distribution is a scatterplot, where each observation is shown with point at the x and y values.
Out[88]:
<seaborn.axisgrid.JointGrid at 0x14066550>
Strength: Positive, Form: linear Relation: Strong correlation relationship
We can infer lesser the age of the house, high the price of the house
Strength: Positive, Form: Non linear Relation: Strong correlation relationship. Few outliers
We can infer price of the house is high on getting renovation. Atleast high cost houses
Out[90]:
<seaborn.axisgrid.JointGrid at 0xef89ac8>
Strength: Negative, Form: linear Relation: Strong correlation relationship. Outliers exists
We can infer price of the house coming down with the age of the house. Atleast high cost houses price towards south
Strength: Negative, Form: linear Relation: Strong correlation relationship. Few Outliers exists
We can infer recent renovations to the house fetching good price. As the age of renovation increases price trend going down
Out[92]:
<seaborn.axisgrid.JointGrid at 0x135cb4e0>
Out[93]:
<seaborn.axisgrid.JointGrid at 0x12e33940>
Strength: Positive, Form: linear Relation: strong correlation relationship. Outliers exists
We can infer ranging 2 to 4.5 bathrooms has good demand. Thereafrer price trend shows downwards
Out[94]:
<seaborn.axisgrid.JointGrid at 0x11084f28>
Strength: positive, Form: linear Relation: Strong correlation relationship. Outliers exists
We can infer area of 500 to 5000 Sqft has good demand and price up trend. Thereafrer price trend shows downwards
Out[95]:
<seaborn.axisgrid.JointGrid at 0x11419f98>
Strength: Negative, Form: linear Relation: Strong correlation relationship. Outliers exists
We can infer price demand decreases as increase in lot measure
Out[96]:
<seaborn.axisgrid.JointGrid at 0x14f5dcf8>
Strength: Postive, Form: linear Relation: Strong correlation relationship. Outliers exists
We can infer house with 1-2 floors has price up-trend and thereafter decreasing trend
Out[97]:
<seaborn.axisgrid.JointGrid at 0x14f48668>
Strength: Postive, Form: Non-linear Relation: Strong correlation relationship. Outliers exists
We can infer house price is uptrend irrespecive of house has a view to a waterfront or not
Out[98]:
<seaborn.axisgrid.JointGrid at 0x15dfa7b8>
Strength: Postive, Form: Non-linear Relation: Strong correlation relationship. Outliers exists
We can infer house price is uptrend irrespecive of number of sight visits. However high price houses has more sight visits
Out[99]:
<seaborn.axisgrid.JointGrid at 0x165e9cc0>
Strength: Postive, Form: Linear Relation: Strong correlation relationship. Outliers exists
We can infer house price is uptrend for every house condition. House whose condition 3 fetching high price followed by 4&5 condition houses
Out[100]:
<seaborn.axisgrid.JointGrid at 0x165e9e48>
Strength: Postive, Form: Linear Relation: Strong correlation relationship. Outliers exists
We can infer house price is in uptrend as increase in the quality grade of the house
Out[101]:
<seaborn.axisgrid.JointGrid at 0x1760c588>
Strength: positive, Form: linear Relation: Strong correlation relationship. Outliers exists
We can infer area of 500 to 5000 Sqft has good demand and price up trend. Thereafrer price trend shows downwards
Similar to the attribute living_measure
Out[102]:
<seaborn.axisgrid.JointGrid at 0x17d5f6d8>
Strength: positive, Form: linear Relation: Strong correlation relationship. Outliers exists
We can infer price increases with increase in the basement area
Out[103]:
<seaborn.axisgrid.JointGrid at 0x185dfc88>
Strength: positive, Form: linear Relation: Strong correlation relationship. Outliers exists
We can infer in certain areas of zipcode, the houses are priced high while in certain areas price is range bound
Out[104]:
<seaborn.axisgrid.JointGrid at 0x18dd3da0>
Strength: positive, Form: linear Relation: Strong correlation relationship. Outliers exists
We can infer the price trend is high as increase in Latitude coordinates of a geographic location
Out[105]:
<seaborn.axisgrid.JointGrid at 0x185dfc18>
Strength: positive, Form: linear Relation: Strong correlation relationship. Outliers exists
We can infer the price trend is high as decrease in longitude coordinates of a geographic location
Out[106]:
<seaborn.axisgrid.JointGrid at 0x19dbe908>
Strength: positive, Form: linear Relation: Strong correlation relationship. Outliers exists
We can infer area of 500 to 5000 Sqft has good demand and price up trend. Thereafrer price trend shows downwards
Similar to the attributes living_measure & ceil meaure
Out[107]:
<seaborn.axisgrid.JointGrid at 0x19dbe668>
Strength: Negative, Form: linear Relation: Strong correlation relationship. Outliers exists
We can infer price demand decreases as increase in lot size.
Similar to lot measure attribute
Out[108]:
<seaborn.axisgrid.JointGrid at 0x1aac8da0>
Strength: Positive, Form: Non-linear Relation: Strong correlation relationship. Outliers exists
We can infer price demand decreases for furnishes houses
Out[109]:
<seaborn.axisgrid.JointGrid at 0x1b5c8160>
Strength: Negative, Form: linear Relation: Strong correlation relationship. Outliers exists
We can infer price demand decreases as increase area of house increases
Similar to lot measure & lot measure15 attribute
Plotting Multi variate distributions¶
Out[110]:
<seaborn.axisgrid.PairGrid at 0x20d5bbe2d30>
The kernel density plot showns the signs of peaks or guassians. Seems about 5 peaks. Perform clustering to identify the number of peaks or gaussians.
Out[114]:
|
yr_built |
yr_renovated |
house_age |
age_after_renovtion |
room_bed |
room_bath |
living_measure |
lot_measure |
ceil |
coast |
... |
ceil_measure |
basement |
zipcode |
lat |
long |
living_measure15 |
lot_measure15 |
furnished |
total_area |
price |
| yr_built |
1.000000 |
-0.225697 |
-0.999874 |
-0.197002 |
0.153745 |
0.505092 |
0.316544 |
0.052447 |
0.489062 |
-0.026537 |
... |
0.422838 |
-0.134590 |
-0.345819 |
-0.148606 |
0.408531 |
0.324568 |
0.070307 |
0.304146 |
0.059212 |
0.051796 |
| yr_renovated |
-0.225697 |
1.000000 |
0.225309 |
0.752069 |
0.018435 |
0.051101 |
0.055084 |
0.007636 |
0.005626 |
0.093025 |
... |
0.022666 |
0.071909 |
0.064347 |
0.029206 |
-0.068736 |
-0.002677 |
0.007870 |
0.017308 |
0.008820 |
0.126803 |
| house_age |
-0.999874 |
0.225309 |
1.000000 |
0.196794 |
-0.153890 |
-0.505467 |
-0.316974 |
-0.052355 |
-0.489389 |
0.026471 |
... |
-0.423182 |
0.134340 |
0.345802 |
0.148125 |
-0.408494 |
-0.324886 |
-0.070299 |
-0.304524 |
-0.059130 |
-0.051748 |
| age_after_renovtion |
-0.197002 |
0.752069 |
0.196794 |
1.000000 |
-0.010110 |
-0.000704 |
0.019150 |
0.014421 |
-0.002249 |
0.104728 |
... |
0.008122 |
0.024546 |
0.046094 |
0.018029 |
-0.052013 |
-0.008548 |
0.011684 |
-0.012124 |
0.014785 |
0.067799 |
| room_bed |
0.153745 |
0.018435 |
-0.153890 |
-0.010110 |
1.000000 |
0.516772 |
0.577512 |
0.031719 |
0.175207 |
-0.006685 |
... |
0.478441 |
0.303035 |
-0.153621 |
-0.010228 |
0.129836 |
0.393161 |
0.029117 |
0.260252 |
0.044318 |
0.309011 |
| room_bath |
0.505092 |
0.051101 |
-0.505467 |
-0.000704 |
0.516772 |
1.000000 |
0.754422 |
0.087143 |
0.499863 |
0.063734 |
... |
0.684994 |
0.283539 |
-0.202958 |
0.023444 |
0.221935 |
0.568194 |
0.086627 |
0.484767 |
0.103419 |
0.524992 |
| living_measure |
0.316544 |
0.055084 |
-0.316974 |
0.019150 |
0.577512 |
0.754422 |
1.000000 |
0.171957 |
0.352867 |
0.103902 |
... |
0.876683 |
0.434419 |
-0.198621 |
0.051118 |
0.239214 |
0.756205 |
0.182775 |
0.632625 |
0.193303 |
0.701639 |
| lot_measure |
0.052447 |
0.007636 |
-0.052355 |
0.014421 |
0.031719 |
0.087143 |
0.171957 |
1.000000 |
-0.005950 |
0.021571 |
... |
0.182581 |
0.015107 |
-0.129411 |
-0.085992 |
0.229555 |
0.143650 |
0.718097 |
0.118133 |
0.999764 |
0.088672 |
| ceil |
0.489062 |
0.005626 |
-0.489389 |
-0.002249 |
0.175207 |
0.499863 |
0.352867 |
-0.005950 |
1.000000 |
0.023473 |
... |
0.522648 |
-0.246066 |
-0.058108 |
0.049279 |
0.124285 |
0.278854 |
-0.011917 |
0.347073 |
0.001851 |
0.255428 |
| coast |
-0.026537 |
0.093025 |
0.026471 |
0.104728 |
-0.006685 |
0.063734 |
0.103902 |
0.021571 |
0.023473 |
1.000000 |
... |
0.072021 |
0.080838 |
0.030558 |
-0.014395 |
-0.042188 |
0.086558 |
0.030694 |
0.069878 |
0.023774 |
0.266927 |
| sight |
-0.055153 |
0.103633 |
0.055174 |
0.090297 |
0.079542 |
0.186684 |
0.283753 |
0.075012 |
0.028211 |
0.403025 |
... |
0.166930 |
0.276482 |
0.087096 |
0.005671 |
-0.079720 |
0.279077 |
0.072867 |
0.219435 |
0.080963 |
0.396766 |
| condition |
-0.364947 |
-0.061136 |
0.364212 |
-0.006348 |
0.027896 |
-0.127839 |
-0.060973 |
-0.009298 |
-0.266543 |
0.016607 |
... |
-0.160782 |
0.174463 |
0.004509 |
-0.015402 |
-0.107780 |
-0.095002 |
-0.003731 |
-0.123725 |
-0.010604 |
0.035114 |
| quality |
0.445329 |
0.014357 |
-0.445767 |
-0.022653 |
0.358396 |
0.664727 |
0.762402 |
0.112541 |
0.457475 |
0.082841 |
... |
0.756068 |
0.167032 |
-0.182675 |
0.113332 |
0.196572 |
0.712507 |
0.118313 |
0.789051 |
0.128890 |
0.666817 |
| ceil_measure |
0.422838 |
0.022666 |
-0.423182 |
0.008122 |
0.478441 |
0.684994 |
0.876683 |
0.182581 |
0.522648 |
0.072021 |
... |
1.000000 |
-0.052456 |
-0.260420 |
-0.001652 |
0.343115 |
0.731992 |
0.193413 |
0.652087 |
0.201167 |
0.604964 |
| basement |
-0.134590 |
0.071909 |
0.134340 |
0.024546 |
0.303035 |
0.283539 |
0.434419 |
0.015107 |
-0.246066 |
0.080838 |
... |
-0.052456 |
1.000000 |
0.075283 |
0.109208 |
-0.145846 |
0.199247 |
0.017283 |
0.092320 |
0.024621 |
0.323812 |
| zipcode |
-0.345819 |
0.064347 |
0.345802 |
0.046094 |
-0.153621 |
-0.202958 |
-0.198621 |
-0.129411 |
-0.058108 |
0.030558 |
... |
-0.260420 |
0.075283 |
1.000000 |
0.267020 |
-0.564588 |
-0.278160 |
-0.146991 |
-0.137849 |
-0.133267 |
-0.051497 |
| lat |
-0.148606 |
0.029206 |
0.148125 |
0.018029 |
-0.010228 |
0.023444 |
0.051118 |
-0.085992 |
0.049279 |
-0.014395 |
... |
-0.001652 |
0.109208 |
0.267020 |
1.000000 |
-0.136369 |
0.047668 |
-0.086505 |
0.080779 |
-0.084519 |
0.306334 |
| long |
0.408531 |
-0.068736 |
-0.408494 |
-0.052013 |
0.129836 |
0.221935 |
0.239214 |
0.229555 |
0.124285 |
-0.042188 |
... |
0.343115 |
-0.145846 |
-0.564588 |
-0.136369 |
1.000000 |
0.333652 |
0.254280 |
0.186484 |
0.233903 |
0.019874 |
| living_measure15 |
0.324568 |
-0.002677 |
-0.324886 |
-0.008548 |
0.393161 |
0.568194 |
0.756205 |
0.143650 |
0.278854 |
0.086558 |
... |
0.731992 |
0.199247 |
-0.278160 |
0.047668 |
0.333652 |
1.000000 |
0.182441 |
0.619443 |
0.159738 |
0.584222 |
| lot_measure15 |
0.070307 |
0.007870 |
-0.070299 |
0.011684 |
0.029117 |
0.086627 |
0.182775 |
0.718097 |
-0.011917 |
0.030694 |
... |
0.193413 |
0.017283 |
-0.146991 |
-0.086505 |
0.254280 |
0.182441 |
1.000000 |
0.128713 |
0.719235 |
0.081807 |
| furnished |
0.304146 |
0.017308 |
-0.304524 |
-0.012124 |
0.260252 |
0.484767 |
0.632625 |
0.118133 |
0.347073 |
0.069878 |
... |
0.652087 |
0.092320 |
-0.137849 |
0.080779 |
0.186484 |
0.619443 |
0.128713 |
1.000000 |
0.131600 |
0.565128 |
| total_area |
0.059212 |
0.008820 |
-0.059130 |
0.014785 |
0.044318 |
0.103419 |
0.193303 |
0.999764 |
0.001851 |
0.023774 |
... |
0.201167 |
0.024621 |
-0.133267 |
-0.084519 |
0.233903 |
0.159738 |
0.719235 |
0.131600 |
1.000000 |
0.103778 |
| price |
0.051796 |
0.126803 |
-0.051748 |
0.067799 |
0.309011 |
0.524992 |
0.701639 |
0.088672 |
0.255428 |
0.266927 |
... |
0.604964 |
0.323812 |
-0.051497 |
0.306334 |
0.019874 |
0.584222 |
0.081807 |
0.565128 |
0.103778 |
1.000000 |
23 rows × 23 columns
- Year of renovation,number of bedrooms, bathrooms, overall square footage & sight visits, having basement and floors,
view towards waterfront, quality with furnished and goegraphic longitude are positively and most coorelated to the
price of the house
- Age of the house, overall square footage, number of bedrooms,bathrooms,floors, sight visits quality with goegraphic
longitude are positively and most coorelated to the furnishing of the house
- Overall square footage, number of bedrooms,bathrooms, quality and condition of the house are positively and coorelated
to the basement of the house
- Age of the house, overall square footage, number of bedrooms,bathrooms,floors & sight visits are positively and most
coorelated to the quality of the house
- Age of the house is positively coorelated to the condition of the house and year renovated
- Overall square footage, number of bathrooms and view towards waterfront are positively coorelated to the number of site
visits to finialize the house purchase
- Year of the house construction with overall square footage, number of bedrooms,bathrooms are positively coorelated to the
number of floors of the house
- Lot area of the house is positively coorelated to the over square footage of the house
- Year of the house construction with number of bedrooms,bathrooms are positively coorelated to the square footage of the house
- Year of the house construction is positively coorelated to the number of bedrooms & bathrooms
- Age of the house after renovation is positively coorelated to the year of renovation of the house
- Age of the house is negative and highly coorelated to the year of the house construction
- Age of the house is negatively coorelated to the number of bathrooms, number of floors of the house
- Age of the house is negatively coorelated to the quality of the house
- Age of the house is negatively coorelated to the furnishing of the house
- Condition of the house is negatively coorelated to the year of the house constructed and facing the waterfront
- Longitude is highly negative coorelated to the zip code
The independent variables number of bedrooms, bathrooms, overall square footage & sight visits, quality, furnished have strong positive correlation between dependent variable Price.